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Abstract 

Traditional machine-learned ranking systems for web searcli are often trained to capture 
stationary relevance of documents to queries, which has limited ability to track non-stationary 
user intention in a timely manner. In recency search, for instance, the relevance of documents 
to a query on breaking news often changes significantly over time, requiring effective adaptation 
to user intention. In this paper, we focus on recency search and study a number of algorithms 
, to improve ranking results by leveraging user click feedback. Our contributions are three- 

' fold. First, we use real search sessions collected in a random exploration bucket for reliable 

^ , offline evaluation of these algorithms, which provides an unbiased comparison across algorithms 

without online bucket tests. Second, we propose a re-ranking approach to improve search results 
for recency queries using user clicks. Third, our empirical comparison of a dozen algorithms 
on real-life search data suggests importance of a few algorithmic choices in these applications, 
including generalization across different query-document pairs, specialization to popular queries, 
and real-time adaptation of user clicks. 
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Ranking a list of documents based on their relevance to a given query is the central problem in 
various search applications of the Internet. Machine- learned ranking algorithms have been shown 



l/^ . highly effective for generalizing to unseen data from labeled training data and have been very 

^ i successful especially in commercial Web search engines; see [18l [3l HI |12l |6l [30l l23] and many 

' references therein for more reference. Usually, such machine-learned ranking algorithms learn 

■ a ranking function based on editorial judgments — relevance labels provided by human editors. 

, A critical assumption here is that the relevance of documents for a given query is more or less 

stationary over time, and therefore, as long as the coverage of training set is broad enough, the 
ranking function learned from the training set would be sufficient to generalize to unseen data in 
the future. This assumption is often valid in web search, especially for popular queries like ^^yahoo" , 
^ I for which document relevance is indeed (almost) static. 

^ ■ However, there are other important categories of applications where document relevance to a 

■ ■ ■ query may change over time. One such example is the recency ranking problem in web search: when 

breaking news emerges, a document that used to be most relevant to a query may be superseded by 
others that have more relevant information about the news; see Section [3] for a concrete example. 
A key challenge for such problems is to track user intention in a timely fashion. 

An interesting attempt was taken recently for tracking non-stationary document relevance [8]. 
The authors devised time- varying features that reflect freshness of documents and utilized recency 
demoted labels provided by human editors that explicitly modify the relevance target values in the 
training set. Their results showed an improvement of ranking qualities for time-sensitive queries. 
However, their approach is still based on editorial judgments and so limited for two reasons. First, 
obtaining high-quality training data is hard. Implementing more fine-grained time- varying features, 
such as features from the time series of clicks that can accurately follow the relevance drifts is 
considerably subtle and complex since carefully testing and selecting good features is a long and 
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complicated process. Also, obtaining laborious recency demoted labels from human editors not 
only is expensive, but also can be inaccurate in correctly representing the temporal variation of 
document relevance. Second, even when we can come up with such complex and expensive data to 
batch-train a ranking function, tracking actual user intention remains challenging due to the very 
unpredictable nature of how user intention evolves over time. 

In this paper, we investigate how to leverage user click feedback to complement and improve 
such editorial-judgment based ranking systems. Our rationale is that, particularly for recency 
queries, instantaneous click trends on the top portion of the ranking list are important indicators 
of document relevance. Such signals allow us to extract subtle information that may be hard for 
human editors to foresee when they provide relevance judgments. In particular, we explicitly track 
the click-through rate (CTR) of a query-document pair using a linear combination of extracted 
features, including the editorial-judgment based ranking function's score. Based on search results 
returned by the current search engine, we propose a re-ranking approach to further improve search 
results for recency queries. We use user click as labels for training the CTR models in either batch 
or online mode. 

In order to reliably evaluate and compare our algorithms, an "exploration bucket" was set up for 
a small random portion of live traffic for recency-classified queries in a commercial search engine. 
Within the bucket, the top URLs returned by the search engine was randomly shuffled. This bucket 
is critical to the work reported in this paper for two reasons. First, it provides a mechanism for 
exploration that is essential for interactive learning problems as the one considered in this paper 
to get rid of evaluation bias (Section [2]). Second, it allows us to obtain unbiased evaluation of 
algorithms without the need for online bucket tests (Section 15. ip . 

This work extensively augments our preliminary results reported in an extended abstract [24) . 
and uses random exploration data to do unbiased evaluation on a dozen of algorithms that improve 
ranking by leveraging user click feedback on a major web search engine. 

The rest of the paper is organized as follows. We describe our exploration bucket in Section [2j 
Using data in this bucket, we present a motivating example in Sectional showing the necessity of 
taking into account temporal variation of document relevance reflected in user click feedback. Our 
methods are detailed in Section [H and empirically evaluated using the exploration bucket data in 
Section [5l We then discuss related work in Section [6] and conclude the paper in Section [71 

2 Exploration Bucket Data 

As described in the previous section, we set up a bucket to collect exploration data from a small 
portion of live traffic from a commercial search engine. The bucket started on Jan 29, 2010 and 
ended on Feb 4, 2010. Throughout these days, we collected 399, 880 search sessions that contained 
61,904 recency classified queries, after removing non-random sessions corrupted by business rules. 
The ranked list for those queries were generated by the recency ranking function trained as described 
in [8] and the ranking score for each query-document was recorded. For each session, we randomly 
shuffied the top four results and logged the permutation id of each shuffled permutation (a total of 
4! = 24 of them) and user clicks on the corresponding permuted ranking results. 

The collected data is very sparse and long-tailed as shown in Figure [H in which 92.4% of queries 
were issued no more than 10 times and more than half of queries were issued just once. The reason 
for this sparsity is that the recency query classifier utilizes some language model to determine the 
queries that are related to each other, which causes some recency-related idiosyncratic, less popular 
queries — such as different word orderings or typographically wrong queries — to be classified as 
recency queries. 

By doing the random shuffling, we are able to collect user click feedback on each document 
without positional bias, and such feedback can be thought of as a reliable proxy on relevance of 
documents. Note that the effect on user experience of shuffling would not be as severe as that 
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Figure 1: Recency query impression. 

for navigational queries, since the relevance differences of top-ranked documents to recency queries 
would not be as dramatic as those for navigational queries. Also, we chose a reasonably small 
number, 4, in order to limit the negative impact on user experience in the exploration bucket. 

Another byproduct of our exploration bucket is that we can accurately observe the positional 
biases of clicks. Specifically, for the top four URLs in each session, we can infer the original ranking 
of the search engine simply by the recorded ranking scores. For each session, the URL with a highest 
ranking score is called the "1st URL" . These URLs were displayed an equal number of times in all 
four positions in the exploration bucket because of the random shuffling. We can then estimate the 
aggregate Click-through rate (CTR) of the 1st URLs in each of the four positions, as depicted by 
the blue line in Figure Such nCTRs are marginal since we have taken all possible layouts (of 
other URLs) into account, thanks to the uniform randomness in the exploration bucket. Similarly, 
marginal nCTRs of the 2nd, 3rd, and 4th URLs of all sessions are also plotted in Figure [2j 

Interestingly, lines of these four marginal nCTRs are almost parallel to each other, which im- 
plies the user click patterns follow the well-known power-law distribution. The slope indicates 
the intrinsic positional biases in the displayed layout of search results. To further illustrate the 
conditional effect on user click patterns, we also present the nCTRs of the original display or- 
der. It corresponds to the steeper straight line of Figure [21 indexed by "Control" in cyan. We 
observed that the nCTR of the 2nd URL at the 2nd position conditioned on the 1st URL at the 
1st position is much lower than the marginal CTR of the 2nd URL at the 2nd position. The 
drop indicates a negative conditional effect from the 1st URL at the 1st position. For the 3rd URL 
at the 3rd position and the 4th URL at the 4th positions, we observed the similar conditional effect. 

The apparent positional biases shown in Figure [2] would be taken into account below in devising 
our re-ranking algorithm. Moreover, the fact that the four lines other than the "Control" do not 
cross each other shows that, on average, the original ranking is doing a decent job in ranking the 
URLs also with respect to CTRs. However, in this paper, we show that we can do better than this 
by re-ranking the URLs appropriately so that the overall CTRs of re-ranked results can be further 
improved. 

^To protect business-sensitive information, the paper only reports the normalized CTR or nCTR, which is the 
CTR multiplied by a constant. We will use them interchangeably if there is no confusion. 
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Position on Log Scale 

Figure 2: Marginal nCTRs of four types of URL at the top four positions in exploration bucket, 
along with the nCTRs on the original order as control. All are on logarithm scale. 

3 Motivation 

Before running into technical details, let us first look at a concrete example found in exploration 
bucket to illustrate our motivation, and then summarize the challenges that we confront in recency 
search results. This example will be revisited in our discussion of experimental results. 

As in [HI; Section 4], recency queries are defined to be time-sensitive queries that show non- 
stationary temporal statistics compared to the past query logs. The query, ^^giant squid in Califor- 
nia", is a typical recency query, which appeared on February 1, 2010 and then disappeared after 
two days in our exploration bucket data. Figure [3] shows the impression statistics of the query with 
respect to timeU This query is related to local news in California. At that time, a number of giant 
squids weighing up to 60 pounds had swum into waters off the Californian coast and were caught 
by sport fishermen by the hundreds. To find related materials of the local news, many people 
submitted the query ^^giant squid in California" to search engines. 

To study user click patterns on the URLs associated with the query, we again examined the 
top four URLs in the exploration bucket. The four URLs were retrieved by the default ranking 
function for recency queries in the search engine, which was trained on editorial judgments in batch 
mode. The ranking generated by the search engine on the four URLs was: 

1. f oxnews . com/story/0 , 2933 , 290667 , 00 . html 

2. en. wikipedia. org/wiki/Giant_Squid_(band) 

3. metroactive . com/metro/03. 29 . 06/squid-0613 .html 

4. youtube . com/watch?v=I3ENZDFkAow 

^To avoid revealing business-sensitive data, we normalize the query submission number by multiplying it with a 
positive number. 
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Figure 3: Impression count in the exploration bucket of the query giant squid in California" . 

Based on their contents, the four web pages can be categorized as "a news story page", "a 
background knowledge page" , "a relevant page" , and "a video page" . As we randomly shuffled the 
display order of the top 4 URLs in the exploration bucket, each URL had the same chance to be 
displayed at each position. Therefore, position bias on clicks for these URLs was removed. The 
nCTRs of the four URLs observed in our exploration bucket session data are presented in Figure 
m Clearly, although the initial nCTRs were similar, the "video" content ended up receiving most 
clicks, while the "news story" was runner-up. This shows that while our recency ranking function 
made a reasonable decision (by putting the "video" content within top 4) , it yet failed to accurately 
predict the ranking with respect to the users' preference reflected in the CTR patterns on the URLs. 
Moreover, we note that unless we actually see this click patterns, it would be extremely hard for 
human editors to predict such relevance patterns. 

Based on the above observations, two challenges are identified for the recency ranking problem. 

• Relevance Drifting: As illustrated by the case above, document relevance may vary signifi- 
cantly over time. The above examples show the limitations of the editorial judgment based, 
batch learning framework in tracking such temporal dynamics. In general, it is very difficult 
to design features that can correctly reflect the temporal variances of relevance and for editors 
to predict the relevance labels before observing the actual user behaviors. How to rapidly 
track such drift would be a major challenge of recency ranking. 

• Data Sparsity: Due to the reason specified in Section [21 many recency queries have few 
impressions. Hence, learning across queries {i.e., generalization) would be important. 

In addition, we note that keeping track of dynamic content for recency queries to generate 
reasonably good top documents is also a critical challenge, but, it is outside the scope of the paper. 
In the present work, we rely on the baseline search engine to retrieve the most relevant documents, 
and focus on improved re-ranking using click feedback. 
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Figure 4: Hourly CTR of the top 4 URLs associated with the recency query "giani squid in 
California". After the 30th hour, no more click of this query was observed in the exploration 
bucket. 

4 Our method 

To address the two challenges in the previous subsection, we believe that it is helpful for a ranking 
module to detect and track the non-stationarity of user interests reflected on the click patterns. The 
example in Section [3] suggests that, when positional biases are removed, CTRs can be important 
indicators of relevances of URLs, especially for recency queries. Moreover, as shown in Figure[21 it is 
also important to remove positional biases and conditional effects in CTR estimates for documents. 
To this end, we assume that CTR at the top position, denoted by CTR@1, is of minimal conditional 
effect, and thus, we use CTR@1 as a proxy of relevance of a URL for learning and evaluating our 
method. 

Ideally, if we knew the true CTR@1 of every document for a query, we could display the 
document with the highest CTR@1 at the top position. However, keeping track of CTR@1 of all 
query-document pairs is unnecessarily difficult: similar to general Web search, it is more important 
to get better estimates for high-quality documents for queries. Therefore, assuming that our original 
ranking function retrieves reasonably good quality documents at the top, we propose a re-ranking 
approach that estimates CTR@1 of top-ranked documents retrieved by the current search engines 
and then optimizes the ranking order by CTR@1 estimates. Our approach is designed specifically 
to address the challenges mentioned in Section [3] for recency queries: 

• Relevance Drifting: In contrast to waiting for editorial judgments to update ranking results 
for a recency query, our algorithm updates CTR estimates near real-time based on user click 
feedback on the ranked list of documents. Not only does it avoid the expensive editorial 
judgments, but it can also quickly adapt to the varying relevance and then maximize CTR. 

• Data Sparsity: Our algorithm works in a common feature space shared by all query-document 
pairs. It is then able to generalize click feedback of a pair to other pairs via feature values. 
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Furthermore, we will also show the benefits of maintaining bias terms, or latent features, 
dedicated to popular query-document pairs in the experimental results. 

One may question the appropriateness of using CTR@1 as a target for ranking problems. How- 
ever, this choice is justified, particularly for the recency ranking problem, for the following reasons. 
First, CTR@1 is already an important metric that is considered for deploying general machine- 
learned ranking function for web search in practice. Second, our example in Section [3] shows that 
CTR@1 can be a more objective metric than editorial labels for the recency queries, for which rel- 
evance judgments are difficult to obtain. Third, our approach does not completely ignore editorial 
judgments, because only the top portion of ranking result list is refined. Since the baseline ranking 
list is obtained based on editorial labels, traditional relevance-oriented metrics like NDCG (Nor- 
malized Discounted Cumulative Gain) [16] are already reasonably high. Finally, it is worth noting 
that our approach can also be used to maximize other metrics of interest such as session length or 
revenue. With these reasonings, we now describe the setting and algorithm of our approach more 
in detail. 

4.1 Settings 

We consider the following re-ranking framework, naturally modeled as a round-by-round process: 
at round t, 

1. A user arrives and types in a query qt- 

2. The default recency ranking function generates an ordered list of s documents with highest 
relevance scores. Then, our re-ranking function re-orders these s documents and present to 
the user the re-ordered ranked list {ut^i, ■ ■ ■ ,ut^s}- 

3. The user then provides feedback rt = {q^i, . . . , ct^s} on our re-ranking result, where ct^j = 1 
if a user clicked on the document at potision i, and otherwise. 

4. Based on the user feedback rt, the re-ranking function is updated and is used for the next 
round t + 1. 

From above described process, we see that our re-ranking function is inherently an online algorithm 
that updates its logic on the fiy from the sequential observation of click feedback. In order to 
efficiently implement and update our re-ranking function, we implement a common feature vector 
for every query-document pair, (g, u), and denote it as x^^j G R'^. In our experiment, a total ofd = 51 
features were used. These features include regular query-specific (e.g., number of words in a query), 
document-specific (e.g., spam classification score of a document), and query-document-specific (e.g., 
number of times a query appears in a given document) features used in ordinary machine learned 
ranking functions, and more importantly, the ranking score generated by the default, editorial 
judgment-based recency ranking function. Our re-ranking function is then defined to be a function 
that predicts the CTR@1 of each {q,u) as a function of Xg^ and possibly of some latent features, 
and the function gets updated based on observing users' click feedback in an online fashion. The 
detailed function form and update formula are described in the subsequent two sections. 

Given our goal of maximizing CTR@1 in the re-ranking results, it is tempting for an online 
algorithm to follow a greedy strategy: that is, it always ranks (for the present query at hand) 
the documents in the order of the highest CTR@1 estimates and updates the function parameters 
solely based on the user feedback for the algorithm's ordered list. While this greedy approach 
is intuitively desirable, it can be detrimental in practice. This is because, as can be seen in the 
interactive round-by-round process described above, the re-ranking algorithm obtains user feedback 
only from the orderings that it has displayed to the user. Therefore, if an algorithm mistakenly 
orders the documents, a greedy re-ranking strategy can prevent it from collecting user feedback for 
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other (potentially better) rankings and correcting its mistake to find the most relevant document on 
the top for maximizing CTR@1. Consequently, the algorithm has to balance two conflicting goals: 
(a) "exploitation" — to display in the first position most relevant documents to maximize re-ranking 
quality (in our case, to maximize user clicks), and (b) "exploration" — to display documents for the 
purpose of collecting data to further improvement. The exploration/exploitation tradeoff described 
above is a defining characteristic of a class of problems known as bandit problems [27] , which has 
received considerable attention recently for Internet-related applications j25l . 

In the following Section 14.21 and Section 14.31 we present the basic function form of our online 
re-ranking function and its update formula provided the user feedback rt is given, respectively. 
Then, in Section [4.41, we describe how we vary our scheme in order to cope with the explore/exploit 
tradeoff explained above and accelerate learning speed. 

4.2 Parametric CTR@1 Estimate 

Although many alternatives exist, we choose our re-ranking function to be linear in the feature 
vector Xqu. This choice allows us to derive exact update rules and simplify the exposition. Other 
non-linear models may also be used, although numerical approximation is unavoidable in general 
when optimizing their model parameter. In particular, we have tried logistic regression and the 
probit-based regression [13j, and observed similar performance as the linear model. 

Since we try to maximize CTR@1, it is natural to find a function that estimates CTR@1 of a 
{q, u) pair for re-ranking. Once the feature vector Xg^ of length d is given for a {q, u) pair, a linear 
combination of them is used to estimate CTR@1. In fact, we will use the most general form that 
captures all variants useful in our experiments: 

CTR@l(g,u) = P^Xgu + b,,u, (1) 

where the vector G M'^ contains the coefficients shared by all query-document pairs, and bq^u £ ^ 
is a (g, tt)-specific bias term. Both /5 and {bg^u} are to be learned by our algorithm. 

Clearly, user click feedback on any query-document pair may be used to estimate 0, which in 
turn can be used to predict CTR@1 for other query-document pairs. Therefore, the linear part of 
in Equation [1] addresses the data sparsity challenge by allowing generalization across different 
queries and documents. However, a linear model in the features may not be sufficiently accurate 
to capture the real CTR@1. The bias terms thus provide a mechanism to correct the residuals and 
to yield more accurate estimates. Due to these bias terms, it may appear that Equation [1] uses too 
many free parameters. However, as will be cleared in the next subsection, we use regularization to 
control the magnitude of these terms, so the bias terms will be essentially zero except for popular 
query-document pairs. Consequently, these terms can be used to yield a highly accurate CTR@1 
estimate for popular (g, it), while for unpopular {q,u) (which suffer the data sparsity issue most) 
we essentially use the linear estimate 0~^'x.qu. Such a dichotomy is done automatically within the 
regularization framework. 

4.3 Parameter Update Rule 

This subsection addresses the problem of parameter updates for model ([T]). We first describe how 
to fit the parameters if we are given a static set of data, then extend the update rule to the online 
case when data arrive sequentially, and finally discuss a few practical issues when deploying the 
update rules in large-scale ranking systems. 

4.3.1 Batch Parameter Fitting 

Suppose we are given a set V of t data in the form of {{qi,Ui,Ci)}i=i^2,...,ti where Cj G {0, 1} is the 
click feedback for (qi, Ui) provided by the i-th user. Let V be the set of distinct {q, u) pairs observed 
in V, and = {V]. For brevity, denote the feature vector for the {qi,Ui) pair as Xj. 
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A standard approach to learn the parameters in Equation [T] for CTR@1 estimation is the ridge 
regression by using {cj}'s as targets: we seek the optimal parameters that minimize a regularized 
square loss: 



where Ai and A2 are positive regularization parameters provided by uers, and b^qu are the prior 
values, and || • II2 is the ordinary ^2-iiorm. Here, regularization is applied to avoid over-fitting and 
to ensure numerical stability. 

Since ([2]) is a least-squares problem with d+N many parameters, one may think it is intractable 
to solve for the exact solution since the computation complexity is 0{{d + N)^) and is often very 
large. Fortunately, using matrix algebra, we can derive a closed-form solution for the minimizer of 
([2]), whose complexity is cubic in d and only linear in N . 

Specifically, we partition the index set {1, 2, . . . , t} into /i U /2 U • • • U /at, so that Ij contains 
indices in P that corresponds to the j-th distinct (g, u) pair. For every j G {1, 2, . . . , A^}, we define 
the following quantities, 



def 
def 



\2 + \Ijl 



d^ 1^^ A26(°)+J^Q. 



In addition, we define 



Ao = AiI + ^Xix7, 

i=l 

do Ai^W + ^c,x, 



Now the optimal solution to the least-squares problem must satisfy the first-order optimality con- 
dition: 



dp 

^ 0, 



for each j G {!,..., N}. Solving this system of linear equations immediately gives the regularized 
least-squares solution: 



(3) 

h* = a-' [dj-h^P*], for each j G {1,2,... ,iV}. (4) 

In other words, the complexity of solving the least-squares problem now becomes 0{d^ + dN), a 
substantial improvement over the 0{{d + N)^) complexity of the naive approach. 
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4.3.2 Online Parameter Updates 



More importantly, the formulas above suggest that we only need to maintain a set of sufficient 
statistics (Aq, dg, aj, hj, and dj) to obtain the exact solution when a new data is added to the set 
V, without the need to re-computing all quantities. 

When a new example {qt+i,ut+i,ct+i) arrives, all these sufficient statistics can be updated 
efficiently in an incremental fashion. In particular, let jt+i be the index of (qt+i,ut-\-i) in V, then 
0{d'^) time is needed for the updates: 



With these updated sufficient statistics, we can now apply Equations [3] and H] to compute the 
exact solutions, which again requires 0{d^ + dN) time. However, there are acceleration techniques 
that can reduce the complexity to 0{d^) and even 0{d), as explained in the next subsection. 

4.3.3 Implementation Issues in Practice 

While the update rules derived above are reasonably efficient, we would still like greater acceleration 
for large d and large A^, so that the response time of the whole re-ranking system can be further 
reduced. 

First of all, the most time-consuming part is the inversion of the matrix in Equation [3l which 
takes 0{d^) time. Fortunately, since every new data results in a rank-one update on the matrix, 
Aq — X^jLi oj^bjbj, a straightforward variant of the famous Sherman-Morrison formula may be 
applied to reduce the complexity to O(d^). 

Second, we may also ignore off-diagonal elements of the matrix, Aq — X^jLi ^J^bjbJ, and so 
inversion can be done very efficiently in 0{d) time. According to our experience (not reported in 
the present paper), this approximation is quite effective, yielding a good tradeoff between solution 
quality and time requirement. 

Third, we note that it is unnecessary to update all bias terms b* every time a new exam- 
ple arrives. In fact, these bias terms can be updated independently, provided that 13* is given. 
Therefore, we may delay their updates until the moment they are used. Specifically, for a new 
example {qt+i,ut+i, ct+i), we may only update bj^^_^, where jt+i is the index of {qt+i,ut+i) in V. 
This lazy-update trick completely removes the time dependency on N, a significant improvement 
when is large. 

Finally, we note that it may even be impossible and unnecessary to explicitly maintain a bias 
term for every (g, u) pair, since only a small fraction of them are popular queries and thus are 
expected to take advantage of those bias terms. A few techniques such as the hashing trick |20] 
may be used to limit the effective number of bias terms. 

4.4 Variations of model 

Given the model form and update formula in Section 14.21 and Section 14.31 there are a couple of 
choices to try for our online re-ranking function, which we describe below. 

Exploration and e-greedy: In Section 14.31 we did not describe how the click feedbacks for the 
data T> are collected. In order to explore rankings other than the output of our re-ranking model 
and collect balanced click feedback in our data set V, we use e-greedy strategy. The e-greedy is 
a simple strategy to handle the explore-exploit tradeoff described in Section 14.11 It collects the 



Ao 
do 





10 



feedback from the randomly permuted ranking with probability e and from the re-ranked result 
by the function ([T|) with probability 1 — e. Thus, by controlling e, we can balance the exploration 
and exploitation for our online learning, and our exploration bucket data enables us to realize this 
strategy. More details on the methodology of using our exploration bucket data are described in 
the next section. 

Warm start: When we are sequentially learning (/S, {bqu}) as described in Section \4:.3\ we need 
not learn them from scratch solely based on online learning (which is known as cold-start), but 
learn a starting point from some already available click logs (i.e., warm- start). The effectiveness 
of such warm-start models could be critical in terms of improving the performance and learning 
speed of our re-ranking function as presented in the next section. 

Using clicks on multiple positions: In Section [4.31 we inherently assumed that the click feed- 
back {cj}'s are the ones received by the user when the document was displayed in the first position 
for the query, since we used them as a target for our CTR@1 function in However, although 
our goal is maximizing CTR@1, we may not limit ourselves to use the click feedback only on po- 
sition 1, that is, ct^i in the = {ct.i,--- defined in Section |4.H but use clicks on multiple 
positions for learning our re-ranking function. In that case, we can enlarge the data set V to 
{{qi,Ui^p,Ci^p)}^~l2''^ and we introduce additional bias terms {bp}p^i to correct the positional 
biases in the click feedback on position p. Then, we model CTR@p as 

CTR@p{q,u) = CTR@l{q,u) +bp (5) 

while modifying the loss function as 

t s 2 

MP, {bqu}, {bp}) =^ Yl {^i'P ~ ^^^i'P ~ K,pu,,^ - bp) 

i=l p=l 

+X^\\0-0(% + X2 Y IK-b^X + XsiZPp-bm (6) 

{q,u)£Vt P=2 

where A3 and b^^ are regularization coefficient and prior for the positional bias terms {bp}. Note 
that we set 6p = when p = 1, and our re-ranking function is still CTR@l(^,'u) — -\- bqu 

learned by minimizing ([6]). In this way, we can utilize more click feedback than only using the 
clicks on position 1 to learn the re-ranking function ([T]). In Section [SI we will show how useful this 
approach is for building the warm start model described above. For the online updates, however, 
in order to control the number of experiments to compare, we remain to use only the clicks at 
the first positions and use the loss function and update formula in Section 14.31 for all of the online 
schemes in our experiments. 

5 Experimental results 

This section reports our experiments on various algorithms for recency search re-ranking using the 
exploration bucket data described in Section [2j Section [5.11 describes an unbiased offline evaluation 
method we will adopt in the experiments. Section 15.21 describes a number of representative algo- 
rithms for comparison. These algorithms are selected to demonstrate benefits of various algorithmic 
choices described in Section [4.41 Section [5.31 presents and analyzes the experiment results in details. 
Finally, Section 15.41 revisits the query examined in Section [3l illustrating how our algorithm adapts 
to user click feedback to re-rank the top documents and yield better results. 

5.1 Unbiased Offline Evaluation 

A tricky part of our problem is that, unlike in supervised learning, it is hard to evaluate and compare 
performance of algorithms using a static set of log data. The reason is that the click feedback in the 
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log depends on the ranking results that the user observed when the log was collected; consequently, 
we do not know what that user might have clicked if the algorithm we evaluate ranked the results 
differently. Fortunately, our exploration bucket data can be used for reliable offline evaluation of 
different algorithms, including both batch or online ones. 

We follow the "replaying" evaluation method studied by [22] for interactive applications like 
the re-ranking problem considered here. First, we hold out the sessions for the latter three days 
in the exploration bucket data and use it as a test set. The first three days' data may be used as 
a training set for batch learning or warm start model described in Section 14.41 We then sort the 
test sessions in the order of time stamps. To evaluate an algorithm's CTR@1 on the test set, we 
maintain two quantities, C and M, which are interpreted as the number of clicks at position 1 and 
number of search sessions, respectively. Both C and M are initialized to 0. 

1. We retrieve the t-th session in the test set, present the top s documents together with their 
features to the re-ranking function. 

2. The re-ranking algorithm then proposes to display one of the documents in the first position 
based on its re-ranking scores. We call it a "match" if this proposed document is the same 
as the one displayed in the first position in the retrieved test session. 

3. If a match happens, we reveal the user feedback q (1 for click and otherwise) to the 
algorithm, and perform the updates: C C + q and M M + 1. 

4. Otherwise, ct is not revealed, and the values of C and M are unchanged. Effectively, this 
session is ignored. 

Finally, the overall CTR@1 of the algorithm in the evaluation process above is C/M. 

For each session in our test set, the probability that a match happens is 1/s for any ranking 
algorithm, since the top s documents are randomly shuffled in our exploration bucket data. There- 
fore, for a test set of L sessions, M equals L/s on average. In our experiments, since L is large, 
M is almost constant across different runs. The following key property justifies the soundness of 
the evaluation method above: it can be proved that the estimated CTR@1, C/M, of an online 
algorithm is an unbiased estimate of its true CTR@1 as if we were able to run it to serve live 
user traffic \^22\ Theorem 2]. Therefore, algorithms that have higher CTR@1 estimates using this 
evaluation method will have higher CTR@1 in live buckets as well. This important fact allows us 
to reliably compare and evaluate various algorithms offline without the costs and risks of actually 
testing them with live users. 

5.2 Models 

There are various options to leverage user click feedback to adjust a re-ranking function. For 
instance, one may expect better adaptation to user interest if a re-ranking system can adjust its 
ranking function in real time based on user feedback; it may also be interested in understanding 
how re-ranking performance is affected by the CTR model, such as the ability to generalize (via 
the linear features) and specialize (via the bias terms) in our model ([1]). 

Below, we describe a few representatives, chosen carefully to demonstrate the benefits of various 
algorithmic choices. The methods are grouped into four categories. 

1. The first is a baseline that is based entirely on editorial judgments and does not leverage user 
clicks at all: 

• frmsc(baseline): We used the recency ranking function [8J deployed in our search 
engine as a baseline. This function was trained using time-varying recency features and 
recency demoted labels provided by human editors. This method does not use click 
feedback. 
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2. The second category contains methods that learn CTR@1 from the first three days' training 
data and then do not onhne update in the test phase. That is, the data V used in the update 
formula of Section [4.31 only consists of the clicks in the first position in the training set. Such 
methods will be compared to their online-learning counterpart. 

• batch(b): This is the linear model in ([1]) trained on the training set, and then deployed 
on the test set without any online updates. Note that there is no positional biases in 
the click feedback in the training set for this model due to the exploration bucket. 

• batch(nb): This is the same as above, but does not use the bias terms in ([1]). In other 
words, only a linear combination of features is used to compute a CTR@1 estimate. This 
model is used to show the benefits of the bias terms. 



3. The third category contains online learning methods in Section 14.31 for re-ranking with e- 
greedy strategy mentioned in Section 14.41 We realize the e-greedy strategy in our online 
learning method by utilizing the exploration bucket data again. That is, while we use the 
exploration bucket data for an unbiased evaluation of performances of various schemes as in 
Section [5. H we use the data once more to incrementally train the online schemes, as in Section 
14.31 More concretely, at time t, the click feedback at the first position q^i is revealed to the 
online schemes for the model updates, no matter whether there is a "match" or not for the 
schemes so that the re-ranking function can observe the feedbacks for all possible randomly 
served documents in the top position to correctly learn the re-ranking based on CTR@1. Note 
that this is effectively simulating the e-greedy strategy with e = 1 and a separate deployment 
test bucket for evaluation. Also, a clear but subtle point is that we are revealing the click 
feedback after the data point was used for the 'replay' evaluation so that we are not training 
and testing with the same data point. Moreover, in practice, we note there is usually a time 
delay between delivery of the ranking result and the receipt of user feedback. To make our 
evaluation and online learning process closer to reality, we make the user feedback ct^i is not 
revealed to the re-ranking algorithm immediately. Rather, these signals are revealed every 
five minutes (based on the time stamps of the test sessions) for our simulation. 

Based on a few algorithmic choices, we tested following variations to see the effect of online 
schemes. 

• online(b) and online(nb): These are the online algorithms that optimize the param- 
eters in ([1]) incrementally based on user click feedback, with and without bias terms, 
respectively. Note that both algorithms learn the parameters from scratch. 

• online(b,ws) and online(nb,ws): These are the same as online(b) and online(nb) 

but use warm-start initialization of the parameters. Specifically, we used the batch- 
learned parameters (in batch(b) and batch(nb), respectively) as jS*-*^^ and blj^J in ([1]). 
We set Ai = A2 = 10. These methods thus combine the prior knowledge extracted from 
previous data with the ability to learn online. 

• online(b,ws,wO): This method is similar to online(b,ws) except that the weight 
vector P is fixed to the warm-start ^^^^ learned by batch(b). Thus, this model performs 
limited online updates and is useful to demonstrate the benefits of online update of 0. 

• counting: Motivated by click models, this method maintains the ratio of cumulative 
clicks and views for each document-query separately. This is an online learning model, 
but does not utilize query-document features for generalization. It may suffer the "cold- 
start" problem on the tailed queries. Essentially, this scheme is equivalent to only 
maintaining the bias terms for observed document-query pairs. 
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4. Batch learning of (warm-start) parameters in the previous two categories is trained on the 
first three days of exploration bucket data. However, although the exploration bucket gives 
the unbiased CTR@1 of the documents as in Section [3l in practice, it is not realistic to always 
require such expensive data in order to build the batch model as a starting point for online 
models. Therefore, we use the controlled log from non-exploration buckets (e.g., production 
bucket) for the same period of time to build the batch models and compare them with the 
models built from the exploration bucket. Note that the controlled log is very cheap to attain, 
but have large positional biases in clicks, so it is not clear to see how well the batch models 
trained on the controlled logs would perform. We include following three variations regarding 
the batch model with the controlled log, which contained 485, 135 sessions from production 
logs that overlaps with the first three days of exploration bucket. H. 

• batch(control@l) This method learns the model in ([T]) using the clicks at the first 
positions in the controlled log. 

• batch(control@4,np) This method learns the model with the loss function in ([6]) and 
using top four positions' clicks in the controlled log. However, it ignores the position 
biases, {^p}'s are all set to zero, so data from all four positions are not distinguished. 

• batch(control@4) This method improves on batch(control@4,np) by considering 
position biases and including nonzero {&p}'s as described in Section 

5.3 Experimental Results 

We ran the algorithms described in the previous subsection, whose overall nCTR@l results are 
summarized in Table [TJ The lifts over the baseline's nCTR@l are also computed. 



Table 1: Overall cumulat 


ive nCTR@ 


1 on the test set. 


algorithms 


nCTR@l 


lift over frmsc 


frmsc(baseline) 


0.770 


0% 


batch(b) 


0.877 


13.90% 


batch(nb) 


0.849 


10.26% 


online(b) 


0.875 


13.64% 


online(nb) 


0.839 


8.96% 


online(b,ws) 


0.901 


17.01% 


online(nb,ws) 


0.851 


10.52% 


online (b,ws,wO) 


0.891 


15.71% 


counting 


0.872 


13.25% 


batch(control@l) 


0.883 


14.68% 


batch(control@4,np) 


0.856 


11.17% 


batch(control@4) 


0.885 


14.94% 



To visualize how instantaneous nCTR@l evolves over time, we also computed aggregated clicks 
in every 6-hour period for the algorithms. Only five algorithms are included in Figure O to ensure 
legibility. 

5.3.1 Results Analysis 

A number of important observations are in order based on the results reported in Table [1] and 
Figure \5\ 

^Note that the onhne methods described above may also be combined with warm-start models learned from control 
logs. We do not include them for comparison in the paper to make the presentation of the results simple. 
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Figure 5: Instantaneous (normalized) CTR@1 of the four models on the test data. Each point was 
measured for an approximately six-hour period. 

First, the advantage of leveraging user click feedback is obvious from the lift of all algorithms 
over the baseline that does not use click feedback at all. The click lift, which ranges from 8.96% to 
17.01%, is statistically significant given the size of our data set. Even the batch-learning methods 
that do not perform online updates are quite strong, capable of achieving at least 10% lift. 

Second, we can see the additional benefits brought by the bias terms in both batch and online 
algorithms. It is true in all cases that an algorithm is better than its counterpart without bias 
terms. In the case between onIine(b,ws) and online(nb,ws), the bias terms account for about 
6.49% lift. 

Third, while batch algorithms have quite strong lift, greater lifts are achieved by algorithms 
that adjust their re-ranking functions online. This benefit is expected since the online algorithms 
are able to extract more information from online click feedback, in addition to the batch-learned 
models. In addition, it is worth pointing out a practically important fact that it is compatible to use 
batch-learned models as warm-start models for online methods. Of all the algorithms in Tabled! 
the greatest lift is achieved by online(b,ws) — the online method that uses batch-learned warm- 
start models and bias terms. Even for the coefficient vector that is shared by all query-document 
pairs, updating it online is still helpful, which is justified by the gap between online(b,ws) and 
online(b,ws,wO). A larger gap is possible if the time span of our test data is larger. 

Fourth, we examine the role of generalization in our CTR@1 model. As discussed earlier, the 
bias terms are essentially zero except for popular query-document pairs, due to the regularization we 
used in the optimization step. Therefore, the linear part in ([1]) determines the CTR@1 estimates of 
tail queries for which we observed one or few sessions. The result in Table [T] confirms our conjecture: 
the method counting which uses the bias terms alone for re-ranking yielded a lower click lift than 
online(b,ws,wO) or online(b,ws). The reason is that the CTR model in counting can't make 
good prediction in "cold-start" situation and so little lift was achieved on the tailed queries. 

Finally, our results suggest it is possible to use control log to build a competitive warm- 
start model. It should be noted that the control log has much more data than the exploration 
data, thus the strong performance. However, position bias has to be considered when clicks 
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■ online 



imp<=2 2<imp<=10 10<imp<=100 100<imp 



Figure 6: Relative CTR@1 lift in percentage (%) over the baseline model (frmsc) for queries with 
different impressions. 

from multiple positions are used, as demonstrated by the gap between batch(control@4) and 
batch(control@4,np). We believe the performance of learning from controlled logs could be 
further improved by advanced click models. We plan to investigate this direction in future work. 

5.3.2 Lift Distribution 
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■ batch 
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1 word query 2 word query 3 word query 4+ word query 



Figure 7: Relative CTR@1 lift in percentage (%) over the baseline model (frmsc) for queries with 
different lengths. 

We now examine lift distribution over queries with different impression and lengths, respectively. 
Results of two representatives, batch(b) and online(b,ws), are reported. 

Figure [6] presents CTR@1 lifts over frmsc aggregated over queries with different impressions. 
We notice that the online(b,ws) model is doing very well on popular recency queries, whereas 
the batch(b) model gives more lift on queries with less than 2 impressions. For queries with very 
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rate 



Figure 8: Relative lift in percentage (%) over the baseline model (frmsc) on various click metrics. 

limited impressions, e.g. less than 2, the online(b,ws) model cannot gain much advantage over 
the batch(b) model. This problem might be mitigated when applying the online(b,ws) model to 
larger traffic. This observation also suggests a practical solution, i.e. employing batch(b) models 
for queries with scarce impressions while using online (b,ws) models for popular recency queries 
only. It is expected that the online (b,ws) model achieves much more lift on popular recency 
queries, since the batch(b) model cannot specialize well on such cases despite the relatively large 
number of impressions. For example, the query ^^giant squids in California" was a popular recency 
query. A batch(b) model well trained on historical click events still fails to foresee the popularity 
and high relevance of the youtube video, whereas the online(b,ws) model does so correctly by 
adapting to users' click feedback; see Section 15.41 for more details of this example. 

Figure [7] presented CTR@1 lifts over frmsc for queries with different lengths. Except for a tie 
in two-word queries, the online(b,ws) model consistently outperforms the batch(b) model. The 
results suggest the online re-ranking method is robust to queries of various lengths. 

5.3.3 Comparison of Other Click Metrics 

Although we have focused on training and evaluating our online re-ranking function based on 
CTR@1, we also compare our results with following other click metrics for ranking proposed in 

m- 

• Query CTR is the average number of clicks for each query 

• 1— Abandonment Rate is the probability of a session receiving a click 

• Max RR (Reciprocal Rank) is the reciprocal rank of the highest ranked result clicked on 

• Mean RR (Reciprocal Rank) is the average of clicked documents' reciprocal ranks 

• Min RR (Reciprocal Rank) is the reciprocal rank of the lowest ranked results clicked on. 
Thus, for all metrics, higher values are assumed to indicate better ranking qualities. 
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Figure 9: Function values in the online model on the 4 URLs associated with the recency query 
^^giant squid in California" . Note that the URL of metroactive.com was replaced by another URL 
(not shown here) after the 23rd hour. 



In Figure [51 both batch and online models show significant lifts over the frmsc baseline on 
all click metrics listed above. Moreover, the online(b,ws) method consistently gives about 2% 
more lift than batch(b). Thus, although our algorithms focus on maximizing CTR@1, it also gives 
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simultaneous lifts on other click metrics as well. This fact shows the easily measurable CTR@1 is 
a good surrogate to optimize, and also justifies our choice of using click at the top position as user 
feedback. 

Ramark: We have done similar experiments on the non-recency queries as well, but our online 
re-ranking scheme did not show too much gain for those queries as in the recency queries, of which 
results are omitted here. The absence of improvements for non-recency queries is expected: the 
relevance of documents with respect to such queries does not change dramatically over time, so the 
re-ranking based on the click feedback may not be too different from the original ranking. 



5.4 Case Study 

We now revisit our example query giant squid in California" in Section [3] to illustrate how our 
online re-ranking function can adapt to the click feedback quickly and track the best re-ranking. 
Coincidentally, the query happened to only appear in our test set. We ran our online model. 



online(b,ws), on the test set, and recorded the function values of the 4 URLs. Figure 9(a) shows 
the function values of 4 URLs for the first 10 hours, and Figure |9(b)| presents the entire temporal 
curves for those function values in the lifetime of giant squid in California". 

Since the initial re-ranking function online (b,ws) is (almost) identical to the fixed one of 



batch(b), we can see from Figure 9(a) that when the query appears around the 5th hour, batch(b) 
orders the four URLs as 

1. f oxnews . com/story/0 , 2933 , 290667 , 00 .html 

2. youtube . com/watch?v=I3ENZDFkAow 

3. metroactive . coin/inetro/03 . 29 . 06/squid-0613 .html 

4. en . wikipedia . org/ wiki/Giant_Squid_ (band) . 

Note that this ranking is different from the frmsc ranking presented in Section [3l That is, although 
the batch(b) has not observed the query ^^giant squid in California" in its training set sessions, 
from the sessions of other queries in the training set, it was able to predict based on the query- 
document features that the "video" page will attract many clicks for the query and improve the 
original ranking. Nonetheless, we see that it still fails to accurately predict the users' click behaviors. 

On the other hand, given the users' click patterns in Figure HI the online(b,ws) promptly 
learns from them and put the "video" content with the highest CTR to the top rank within an 
hour. The ranking was then maintained for the rest of the time. Then, after the 25th hours, when 
the impression of the query quickly decreased toward as shown in Figure [3l the function values 
of online(b,ws) were kept intact as can be seen in Figure [9 (b)[ 

Thus, this example indeed illustrates how re-ranking algorithms may benefit from user click 
feedback to improve ranking results. Furthermore, by real-time adaption online(b,ws) can quickly 
learn from users' click patterns and outperforms not only the editorial-based batch recency ranking, 
frmsc, but also the click-based batch re-ranking, batch(b), for recency queries. 



6 Related work 

As mentioned in Section [TJ the machine-learned ranking framework has been extended to the 
recency search problem. Previous work [8] introduced query classifiers to detect time-sensitive 
queries, implemented time-varying features that reflect document freshness, and recency demoted 
labels provided by human editors. More recently, improvements are made by introducing additional 
click related time- varying features [155i9j. Algorithms studied in our work differ from them in two 
ways: (i) we use user click as training labels to gain further improvement; (ii) instead of fixing the 
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ranking function and adjusting time-varying features, our features need not change over time, but 
the parameters of the ranking function can be automatically adjusted in real time based on user 
clicks. 

Another related work [7] considered an online algorithm of slotting news direct display modules 
for recency queries in the search results page but did not consider re-rankings of documents. 

Using users' click feedback to improve ranking quality of a search engine has been extensively 
studied before. User behavior models {e.g., [W\^^14:\) are developed based on click log data, whose 
outputs were then used as features for batch-training a ranking function. However, it is not easy 
to reflect temporal variations of document relevance in these works since the features were often 
computed in an average sense. 

A few other works also used click data to directly modify their ranking based on the inference 
on the users' relative preferences on rankings, but their settings or focuses are different: the method 
of |17) remained in the batch-learning mode and did not consider the temporal dimension of the 
click data; [25] was similar in spirit to ours but did not consider strategies that generalize to tail 
queries, and their results were based on simulated user clicks rather than real ones; finally, the 
dueling bandit approach [29j required a special functionality of the retrieval system to interleave 
two different ranking results. 

Taking temporal variation of relevance into account to produce better rankings has also been 
considered in the past. [11] considered the temporal variations of document content and applied 
that knowledge to improve search ranking, but did not utilize click feedback to directly refine 
the ranking. [IQj devised a scheme to capture temporal dynamics of user ratings on items in a 
collaborative filtering problem, but focused rather on long-term dynamics and did not consider 
the cold-start problem, which is critical to our recency ranking application give the large volume 
of of tail queries. Personalized article recommendation on web portals is another closely related 
problem. While models in earlier works {e.g., [1]) did not generalize, there have been efforts on 
generalization more recently, such as the LinUCB algorithm [21| that uses a similar linear model 
as ours, and the warm-start solution by [2]. However, both work remain to maintain models for a 
small number of articles/items, and so have not demonstrated capacities of learning with an almost 
infinitely large content pool, as in the space of query-document pairs in search domain. Another 
difference is that their model was more item-specific, whereas our model consists of both global 
model that applies to all queries and documents and specific bias term for each query-document 
pair. As a similar, independent thread of work, [28j also considered the large-scale personalized 
recommendation problem, but imposed a Bayesian framework, which is different from our work. 

7 Conclusions 

In this work, we investigated various learning algorithms to re-ranking recency search results based 
on real-time user feedback. Our contributions are three-fold. First, our evaluation method is novel 
for web search — a random exploration bucket was used to collect user feedback, which not only 
removed positional bias but also allowed one to reliably evaluate online learning algorithms offline. 
Second, we proposed a re-ranking approach to improve current search results for recency queries, 
and carried out extensive empirical results for a dozen of variants. Third, we demonstrated the 
need for using online learning as a flexible machine learning paradigm to adapt a ranking system 
to time- varying document relevance. 

In future work, we would like to explore other options for correcting position biases and using 
clicks on multiple positions, e.g., using multiplicative bias correction terms or using user click 
models (e.g., [5j), so that we can effectively increase the size of training data and thus may result 
in faster learning speed in practice. In this work we focused on ranking documents based on 
individual document's CTR estimate. It is also much more challenging to design algorithms for 
the best permutation of a set of documents, in which interactions between documents can be taken 
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into account. 
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